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Abstract 

We study learning of probability distributions characterized by an unknown symme- 
try direction. Based on an entropic performance measure and the variational method 
of statistical mechanics we develop exact upper and lower bounds on the scaled critical 
number of examples below which learning of the direction is impossible. The asymp- 
totic tightness of the bounds suggests an asymptotically optimal method for learning 
nonsmooth distributions. 

PACS numbers: 87.10. +e, 05.20. +m, 02.50.-r 

In recent years, methods of Statistical Physics have contributed important insights into 
the theory of learning with neural networks and other learning machines (see e.g. [jl], 0, ^). 
Among the most prominent discoveries of statistical mechanics in this field is the occurence 
of phase transitions in the progress of learning, when the number of presented example data 
is gradually increased and the dimensionality of data and model parameter is large. 

Besides the ubiquity of phase transitions in discrete parameter models, they are typically 
observed when the learning problem contains symmetries which are spontaneously broken 
when the scaled number of examples increases beyond a critical value |^, |^, ^. Although 
phase transitions in neural networks have been analysed extensively by the method of replicas 
it is usually hard to present a rigorous analysis (for an exception see e.g. ^ and the recent 
attempts of Talagrand §]). Hence, this often precludes a digestion of the interesting results 
by researchers outside the community of statistical physicists working on disordered systems. 
Unfortunately, also other standard techniques based on asymptotic expansions will not 
apply in these cases. They are only valid when the number of data is much larger than the 
number of parameters. 



1 



In this letter we will present a rigorous and simple approach to these problems. We 
combine information theoretic bounds for the performance of statistical estimators (see e.g. 



TT| , |T2[) with an elementary variational principle of statistical physics |T^. This will 
allow us to compute rigorous upper and lower bounds for the critical number of examples at 
which a transition occurs. 

We will explain our method for the case of retarded unsupervised learning which has 
been analysed before using the rephca framework (see e.g. 0, |15|, |16|, |l7l). The goal of 
unsupervised learning is to find a nontrivial structure in a set of data which reflects the 
properties of the underlying data generating mechanism rather than being an artefact of 
statistical fluctuations. The phenomenon of retarded learning describes the fact that for some 
high dimensional probability distributions, it is impossible at all to predict the underlying 
structure (usually a symmetry axis) if the (scaled) number of data is below a certain critical 
value. Only above this value, estimation of the structure can start. 



We adopt a probabilistic, Bayesian formulation of unsupervised learning following WE 
p!0| , p!5[| . We model a situation where the probability distribution of the data is characterized 
by a single unknown rotational-symmetry direction w*. More specifically, we assume that 
a set of t data = xi, . . . ,Xt, has been generated independently by t samplings from a 
distribution of the form 

P{x\w*) = Po{x) exp{-V{X)) (1) 

where Po{x) = ^2Tr)^/'^ exp(— a;^/2) is a spherical Gaussian distribution and A = w* ■ a; is 
the projection of the N-dimensional data vector x on the direction defined by w*. The 
distribution of the projection is given by p{X) = -^exp(-AV2 - V{X)). In the following, 

averaged quantities with respect to p{X) will be denoted with an overline (..) = / dXp{X){..). 
Based on the set of data x*, the goal of a learner is to produce an estimate Pt{x\x'^) for 
the true distribution P{x\w*). Pf will not necessarily belong to the given parametric class 
(0). Our approach relies very much on the choice of a specific measure for the quality of the 
estimation. Rather than computing the overlap between an estimated direction and the true 
w*, we choose a quantity which is directly measuring our ability to compress the data based 
on the information we have gained on the structure of P. This is related to the averaged 
relative entropy (Kullback Leibler (KL) divergence) between the true distribution and the 
estimate 

L{Pt,w*) = f dx'P{x'\w*) f dxP{x\w*)\n ^S^^'^ ^ (2) 
J J Pt{x\x*-) 

where P(a;*|w*) is shorthand for the product distribution 11^=1 P{xi[w*) and dx^ = Y[l=i dxi. 
We will further adopt a Bayesian approach where we assume that "nature" draws the true 
parameter w* at random from a (noninformative) prior distribution p(w*) and associate 
measure (i/x(w*) = p(w*)(iw* given by the uniform distribution on the sphere with radius 
||w*|p = 1. The case of a discrete prior will be discussed later. 
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The progress of the learning will be measured by the cumulative risk defined by 



Rm{P)="E I dfi{w*)L{Pt,w* 

t=0 '' 



(3) 



This measure of loss has a variety of important applications in information theory, game 
theory and mathematical finance (see e.g. [|T^, |T^]). E.g., it is proportional to the expected 
extra number of bits (assuming a reasonable quantization of the Xi) we have to suffer in 
compressing the observed data when their distribution is unknown and a sequential estimate 



is used instead [0]. As we will see in a moment it has also an important meaning in statistical 
physics. 

A first attempt to study retarded learning by using bounds on (|^) was undertaken by 
\L^ . However the bounds were too weak to give a nonzero bound on the critical number of 



examples below which learning is impossible. 

An elementary calculation shows that the posterior probability Pf^'^'^^^xlx^) = 
J t^Mw) p{x\w)P{x\w) ^^YiieYes the minimum risk i?^"^^"* = RmiP^"'^^^) over all choices of 
estimators. Inserting this estimator into (|) and using (|l]) we get 



j^Bayes 



In y'rf/i(w)e-S«^^(--^»)-^(-'-^>» 



(4) 



The last line looks very much like an averaged free energy in statistical mechanics for a system 
with hamiltonian J2i {V^(w ■ Xi) — V^(w* ■ Xi)}. Hence we can expect that useful bounds for 
this quantity can be derived using the standard variational principle of statistical mechanics 
W^ for the free energy 



In J rf/i(w)e~^(^) -~^^J ^/^(w)e~^»('^) + {H - H^)^ 



(5) 



which bounds the free energy of a system with hamiltonian H in terms of the free 
energy of a trial hamiltonian Hq plus a correction term. The brackets ((..))o = 
/ (iAi(w)e-^«(^)(..)/ / d/i(w')e-^o(^') denote an average with respect to the Gibbs distribu- 
tion defined by Hq. Using appropriate choices for H and Hq, we will get both upper and 
lower bounds on (^). 

We begin with the lower bound. We set Hq = X)i {V^(w ■ Xi) — V{w* ■ Xi)} and H = 
J2i {XV{w ■ Xi) — 7l^(w* ■ Xi)} where A,7 > are variational parameters. Averaging both 
sides of (^) over P{x"^\w*) and p(w*) using Jensen's inequality in the second line, we derive 
the lower bound on i?^"^*^* 



j^Bayes > J dfx{w*)dx"' Pix'^lw* 

dji{w*) In / dfi{w) 



— In / dfi{w)e 



-H 



+ m(7 - X)V 



> 



dxPo{x) 



g-Ay(a;-w) 



g(i-7)v(a;-w*) 



+ m{-f~X)V (6) 
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= - In I J'^ dq WN{q) [Fx,{qT^ + m (7 - \)V 

where W^Ar(g) = / dfi{w) 6{q — w.w*) oc (1 — q^)^^ and 

F^^{q) = j Dx j Dy e-^n-)-ii-^)V(xg+y,/I^) ^ 



1 2 / _ 

with the gaussian measure Dx = e~2^ dx/\/27r. This bound holds for every N and every m. 

To show the phenomenon of retarded learning we will compare the cumulative risk of 
the Bayes estimator to the risk of a trivial estimator which assumes that there is no specific 
structure in the data and always predicts with the spherical distribution Pj:"'"{x) = Po{x) 
thereby achieving the trivial total risk R'^™ = Rm^P^™) = —mV. Note, that is a 
nonpositive quantity. We are interested in the difference Ai?^ = -Rm™ ~ R^^^'^ between the 
trivial risk and the Bayes risk. Taking the thermodynamic limit N 00 with a = m/N 
fixed, we can evaluate the integral in (P) by Laplace's method which gives the asymptotic 
upper bound for /S.Rm 



limsup Ai?„Ar/A^ < min(^ln(l - g^) + alnF^A(g)| + a(A - 7 - 1)V 



(7) 



For sufficiently small a, the bound (|^) is optimized for 7 = and A = 1. For any potential V 
having the property A = (ie. when the problem is not trivially learnable by computing the 
mean of the data) there is a critical value an, = (1 — A^)~^ such that as long as a < aib the 
minimizer is g = and limTv^oo AR^n/N < (see Fig. [I|). Since the Bayes risk is minimal, 
we have ARm > and we conclude that limTv^oo ARaN /N = at least for a < aib. This 
proves the existence of a region of retarded learning, where even the risk of the optimal 
Bayes estimator is to leading order in as large as the risk of a trivial estimator which 
assumes that there is no any spatial structure at all. The bound an, agrees with the critical 
a obtained in the replica analysis of [pTSd . 

We next derive a lower bound on ARm- Using the fact that R^^^'^ is the minimimum 
cumulative risk over any choice of estimators, for any distribution Q(a;|w) and estimator 

and restricting ourselves to the class of trial Hamiltonians Hq which do not depend on a;'", 
it can be shown that the optimal choice is the data average Hq = J dx^ P{x^\w*) H, for 
which, on average, the correction term in vanishes. This yields 

R^""'-' < RUQ) (8) 

/rf/i(w) Qix'^lw) 



di^{w*)dx"'P{x"'\w*)\n- 



< — J dfi(w*)\n J (i/i(w) exp —m J dxP{x\w*)ln 



P(a;'"|w*) 

Q{x\w) 
P(x\w*] 
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We now have to find a good choice for Q{x\w). For Q{x\w) with a structure of the form 
(H), we set Q{x\w) = Po{x) exp {—U{w ■ x)). In the thermodynamic hmit we get the lower 



bound for AR 



■m 



hminf Ai?„;v/A^>max|-ln(l-g2) f Dx exp{-Ug{x))U{x)} (9) 



with 



Uq{x) = — In y Dy exp 



-V{xq + yJl - g2 



(10) 



It is easy to see that for any q, the expression in the curly brackets of (^ is maximised for 
U{x) = Uq{x). With this choice for U, we find that there exists an such that for a > a^t, 
we have lim7v-»oo ^Ron/N > which means that now the performance of the Bayes risk is 
better than the trivial risk and a nontrivial estimation of the direction w* is possible (see 
Fig. |l|). aub gives an upper bound on the region of retarded learning but has no simple 
analytical expression. 

Our approach is also easily applied to a discrete prior distribution, e.g. a uniform dis- 
tribution on the hypercube pO|. Again, for small a we find a region of retarded learning 



AR^y^^ /N = at least for a < aib where aib is exactly the same as for the spherical prior. 

For illustration, we apply the bounds to the simple case of a Gaussian distribution for 
which the integrals can be done analytically. Other distributions will be discussed in [^]. We 

set P{x\w*) = (^2n)'^/\i+A) ^-^P 2(1+ A) ■ '^*)^) • '^^^ data are normally distributed 

with unit variance in all directions perpendicular to w* and with variance 1 + A in the 
direction w*. The upper and lower bounds and (^) (optmized w.r.t. A and 7) are 
shown in Figure |I] for A = —0.5 for which we obtain = l/A"^ = 4. We have compared 
the bounds with numerical simulations. Since it is hard to compute the Bayes optimal 
estimation algorithmically, we have used the following (suboptimal) algorithm instead. We 
have computed the direction w(a;*) for w* which maximises the posterior probability of 
the data. The estimate of the distribution is given by the plugin estimator P[{x\x'^) = 
P{x\w{x^)) which has the KL divergence 

/ dxP{x\w*) In = -4^(1 - (w* ■ wix^m 

J Pi[x\x^) i + /I 

and the cumulative entropic risk Rm{P^) can be easily approximated numerically by aver- 
aging over a large number of data sets. Figure |^ shows the difference Rm{P'^"^) — Rm{P^)- 
Since the Bayes risk is minimal, the upper bound on ARm is also an upper bound on every 
estimator while the lower bound is only a lower bound on the Bayes risk. We see that until 
a ^ 4, ARm is negative and decreases linearly. Since the slope of the dash-dotted curve is 
proportional to the (relative) instantaneous loss (i.e. the non cumulative risk), the plugin 
estimator is in the retarded learning regime, its instantaneous loss is even bigger than that of 
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Figure 1: Upper and lower bound for the Bayes Risk ARm/N = {R^^^^ — _R^™*'*')/A^ in 
the hmit N —>■ oo for the gaussian case A = —0.5. an, = 4. Simulations are for the plugin 
estimator and show {Rm{P*'"'") — Rm{,P^))/N for N = 100 averaged over 50 data sets. 



the trivial estimator. This is due to the fact that the plugin estimator has to keep the elliptic 
form of the distribution P{x\w) which is always different from spherical when w(a;*) ^ 0. 
Between a > 4 and a < 7.5, the performance of the plug-in estimator improves, the slope 
increases but is still negative. For a > 7.5, the performance of the plugin estimator is better 
than the trivial one and the curve starts to increase. The Bayes estimator does not have this 
kind of disadvantage and can take a form closer to the spherical distribution in the retarded 
learning region by smoothing over the parameters w*. 

Both bounds in figure |l] show the same type of asymptotic growth for a oo. The 
asymptotics can be found analytically by expanding both bounds (|^ and @ for g — > 1. 
For a smooth potential V both bounds give asymptotically the same logarithmic growth 
R^y^'^/N ~ 1/2 In a which can also be obtained by well known asymptotic expansions 
involving the Fisher information matrix p2| , |25| , |TI|. On the other hand, our bounds can also 
be used when these standard asymptotic expansions do not apply, e.g. when the potential 
exhibits a discontinuity |TB[ of the form V{X) = — In 20(A) + U{X) with corresponding 
projection distribution p(A) = 0(A)^= exp(— A^/2 — U{\)) where 0(A) is the Heaviside 



function and U is smooth. Our bounds yield the asymptotic scaling R^y'^'^ /N ~ In a. 

The asymptotic matching of our bounds has an important consequence for computing 
asymptotically good approximations to Bayesian predictions. Such approximations are easily 
derived for smooth potentials by local expansions of the posterior distribution around its 
maximum (^3). However, this technique will obviously fail for nonsmooth potentials. On 
the other hand, our results show that the estimate Q in (§) which uses the smooth optimizing 
potential Uq ([10|) , has the same asymptotic performance as the Bayes optimal estimate 
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R^y^'^. The smoothness of Uq will again enable local expansions. For example, following 
(P^), the case V{X) = — In 26(A) can be well estimated using Uq{X) = — In 2H{—qX/y/l — q^) 
with H{x) = Dx and where the maximizer q of the right hand side of is a function 
of a. 

In this letter, we have put the phenomenon of retarded learning first established by the 
replica method on a rigorous footing. Our method relies on a general information theoretic 
performance measure for learning probability distributions which is related to the free energy 
of statistical physics. A variational principle yields a controlled approximation to this quan- 
tity by providing exact upper and lower bounds which are valid for arbitrary dimensionality 
of the problem. We expect that this framework is flexible enough to be applicable to more 
complex and realistic probabilistic models. It may also be useful for constructing criteria 
that help to decide if structures estimated from a dataset in a high dimensional space reflect 
a real feature of the underlying data generating mechanism or if the result is expected to be 
a spurious effect of random fluctuations. 

We are grateful to J.-P. Nadal for helpful discussions. 
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